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Abstract 

Linguistic  code  switching  (LCS)  occurs 
when  speakers  mix  multiple  languages 
in  the  same  speech  utterance.  We  find 
LCS  pervasively  in  hilingual  communi¬ 
ties.  LCS  poses  a  serious  challenge  to 
Natural  Language  and  Speech  Process¬ 
ing.  With  the  ubiquity  of  informal  gen¬ 
res  online,  LCS  is  emerging  as  a  very 
widespread  phenomenon.  This  paper 
presents  a  first  attempt  at  collecting  and 
annotating  a  large  repository  of  LCS  data. 

We  target  Hindi  English  (Hinglish)  LCS. 

We  investigate  the  feasibility  of  leverag¬ 
ing  crowd  sourcing  as  a  means  for  anno¬ 
tating  the  data  on  the  word  level.  This 
paper  briefly  explains  the  setup  of  the 
experiment  and  data  collection.  It  also 
presents  statistics  representing  agreements 
among  annotators  over  different  possible 
categories  of  Hinglish  words  and  analyzes 
the  confidence  with  which  a  code  switched 
word  can  be  annotated  in  the  correct  cate¬ 
gory  by  humans. 

1  Introduction 

Linguistic  Code  switching  (LCS)  is  the  term  used 
to  describe  a  common  practice  among  bilingual 
speakers  of  a  given  language  pair  in  which  the 
speakers  switch  back  and  forth  between  their  com¬ 
mon  languages.  This  phenomenon  is  dominantly 
observed  in  inhabitants  of  countries  like  India 
where  Hindi  is  a  common  first  language  (LI)  and 
English  acts  as  a  second  language  (E2)  among  na¬ 
tive  Hindi  speakers.  Eor  example,  the  following 
Hindi  sentence  with  code  switches  to  English  is 
a  seamless  example  of  North  Indian  conversation: 
Vske  communication  ki  wajah  se  hi  project  suc¬ 
cessful  hua  hai.  (Project  has  become  successful 
because  of  his  excellent  communication.)  ECS  oc¬ 
curs  both  inter-sentential  and  intra-sentential. 


ECS  occurs  in  all  genres  of  communication 
for  such  speakers,  including  spoken  conversation, 
email,  online  chat  rooms,  blogs  and  newsgroups. 
Thus,  it  seriously  impacts  attempts  to  process 
these  exchanges  computationally,  for  the  purposes 
of  automatic  translation,  speech  recognition,  and 
information  extraction,  inter  alia  (Solorio  and  Eiu, 
2008a;  Solorio  and  Eiu,  2008b). 

With  increasing  interest  in  ECS,  there  is  need 
for  large  annotated  ECS  corpora  which  can  sup¬ 
port  the  needs  of  computational  as  well  as  theo¬ 
retical  research.  This  paper  presents  one  experi¬ 
ment  where  a  corpus  of  code  switched  sentences 
is  annotated  for  identifying  code  switch  points  us¬ 
ing  crowd  sourcing  methods.  The  data  collection 
serves  as  the  first  attempt  at  creating  a  reposi¬ 
tory  for  ECS  data.  Also  the  annotations  of  ECS 
points  will  shed  light  into  the  nature  of  this  phe¬ 
nomenon  and  will  be  an  initial  building  block  for 
the  development  of  interesting  analytical  and  pre¬ 
dictive  models  for  automatic  ECS  processing  sys¬ 
tems.  It  is  widely  accepted  that  ECS  actually  fol¬ 
lows  a  certain  pattern  and  that  it  does  not  occur 
randomly.  Several  studies  in  sociolinguistics  and 
theoretical  linguistics  have  investigated  this  issue 
however  on  a  small  scale  (Poplack,  2001;  Myers- 
Scotton,  1993). 

2  Hindi  and  Hinglish 

Hindi  is  the  national  language  of  India  and  na¬ 
tive  language  of  many  parts  of  the  country.  It  has 
continuously  been  impacted  by  varied  languages 
and  dialects  of  the  country,  the  most  influential  of 
which  is  English.  English  expanded  its  roots  into 
India  from  the  time  Britain  occupied  the  country. 
It  was  initially  the  language  of  the  elite  upper  class 
but  as  the  education  system  became  widespread, 
English  spread  across  the  whole  country.  With  the 
proliferation  of  scientific  advancements  in  the  En¬ 
glish  speaking  world  and  India’s  race  for  technol¬ 
ogy  acquisition,  we  note  that  English  has  almost 
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become  an  Indian  language.  In  fact,  Indians  from 
different  parts  of  India  who  speak  mutually  unin¬ 
telligible  native  Indian  languages  use  English  as 
the  bridge  language  to  communicate.  The  perva¬ 
siveness  of  English  coupled  with  the  Hindi  edu¬ 
cation  throughout  the  country  led  to  rapid  devel¬ 
opment  of  Hinglish,  a  term  coined  to  describe  the 
use  of  Hindi  and  English  words  in  the  same  utter¬ 
ance,  Hinglish  ECS.^  Since  Hindi  is  a  morpholog¬ 
ically  rich  language,  we  even  often  observe  ECS 
occurring  on  the  morphological  level.  Hinglish 
has  become  a  widespread  phenomenon  (as  a  lan¬ 
guage  in  and  of  itself  even)  used  by  Indians  in  dif¬ 
ferent  parts  of  the  world.  It  is  obvious  that  the  con¬ 
text  switches  from  Hindi  to  English  are  very  fre¬ 
quent  and  some  words  that  are  borrowed  such  as 
Thank  You,  Please,  Crazy  are  almost  Hindi  words, 
as  they  have  become  part  of  the  Indian  native  lex¬ 
icon.  One  important  reason  that  smooth  switch 
can  occur  between  Hindi  and  English  is  that  words 
from  any  of  these  languages  can  hh  the  lexical 
gaps  in  the  sentence  of  the  other.  It  is  important  to 
point  out  that  ECS  is  beyond  nonce  and  borrow¬ 
ing,  the  phenomenon  in  Hinglish  is  that  of  signif¬ 
icant  amounts  of  words  and  chunks  are  switched 
back  and  forth  in  the  same  utterance,  it  is  not  a 
matter  of  isolated  borrowed  words  that  are  highly 
frequent  in  the  Hindi  lexicon. 

Our  paper  attempts  to  describe  an  initial  large 
scale  collection  of  ECS  data  and  annotate  it  on  the 
word/token  level.  Several  linguistic  studies  have 
investigated  Hinglish  on  a  theoretical  level  (Bhatt, 
1997;  Joshi,  1985)  as  well  as  socio-pragmatic 
level  as  in  the  work  of  Bhatt  and  Bolonyai  (2008). 
The  studies  suggest  that  ECS  occurs  in  a  sys¬ 
tematic  manner.  However  to  our  knowledge  no 
large  collection  of  ECS  data  for  Hinglish  exists, 
let  alone  detailed  annotations  for  such  a  collection. 
Our  initial  attempt  is  to  hll  this  gap  such  that  it 
would  be  of  utility  to  both  the  theoretical  linguis¬ 
tics  as  well  as  the  computational  processing  helds. 

3  Corpus  Collection 

We  needed  content  where  the  matrix  language 
was  Hindi  with  frequent  code  switches  to  En¬ 
glish.  Modern  Hindi  novels  are  rich  sources  for 
such  content  as  they  use  Hinglish  frequently.  The 
content  of  two  sites:  www.hindinovels.net  and 
www.abhivyakti-hindi.org  were  crawled  using  perl 

*For  the  purposes  of  this  paper,  we  are  not  interested  in 
the  inter  sentential  LCS. 


scripts  and  broken  into  sentences  to  develop  the 
corpus.  Some  Hindi  sentences  with  no  CS  were 
mixed  with  these  sentences  to  prepare  an  opti¬ 
mal  blend  of  sentences  for  annotations.  The  hnal 
corpus  consisted  of  10500  sentences  comprising 
193285  tokens. 

4  Experiment  Setup 

Amazon  Mechanical  Turk  (AMT)  is  a  market¬ 
place  to  host  surveys  where  requesters  host  some 
questions  which  are  answered  by  workers,  aka 
turkers.  It  has  been  widely  accepted  that  the  use 
of  crowd  sourcing  techniques  for  the  collection  of 
data  annotations  is  a  worthwhile  effort  (Snow  et 
ah,  2008).  The  beneht  of  using  crowd  sourcing 
lies  in  a  rapid  collection  cycle,  sometimes  at 
the  expense  of  quality.  Hence  the  challenge 
lies  in  designing  and  simplifying  the  task  and 
presenting  it  to  lay  people  in  generic  terms.  But 
also  setting  performance  metrics  for  accepting 
such  annotations.  We  carried  out  our  experiments 
on  AMT  where  we  asked  the  turkers  to  identify 
each  word  in  a  sentence  as  one  of  the  following 
categories:^ 

1.  Hindi-  aaya{came),gaya{went),hum{'we) 

2.  English-  usual  English  words,  for  example,  eat, 
grin,  happy 

3.  Eoreign  Proper  Name-  John,Stella,  IBM 

4.  Indian  proper  Name-  Ramesh,Ganesh,  Anjali 

5.  Unknown-  Any  word  which  can  not  be  classi- 
hed  into  any  of  the  above  categories 

The  experiment  was  set  up  as  a  survey  with 
three  Hinglish  sentences  on  one  page.  Each  of 
such  pages  is  termed  a  Human  Intelligence  Task 
(HIT)  and  a  collection  of  HITs  is  termed  a  task  on 
AMT.  Our  collection  of  10500  sentences  was  di¬ 
vided  into  7  tasks,  each  task  containing  500  HITs 
with  3  sentences  each.  Each  word  in  a  sentence 
had  a  drop  down  list  containing  the  above  options 
associated  with  it,  with  the  default  option  being 
Hindi.  The  AMT  turkers  then  marked  each  word 
in  the  sentence  as  one  of  the  options  above.  A 
minimum  of  two  turkers  were  allowed  to  work  on 
a  single  HIT  or  the  same  set  of  3  sentences  in  order 
to  allow  overlap  for  agreements/disagreements  on 
same  set  of  words.  Accordingly  ah  the  data  was  at 

^For  this  pilot  annotation,  we  did  not  include  the  more 
complex  annotation  of  mixed  Hinglish  morphology.  We  de¬ 
cided  to  postpone  that  annotation  for  a  later  phase. 
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least  doubly  annotated. 

A  subset  of  HITs  (10%  of  the  eorpus  size)  was 
gold  annotated  by  a  native  bilingual  speaker  of 
Hindi  and  English.  We  designed  the  set  up  of 
the  HITs  sueh  that  for  any  given  turker  at  least 
one  sentenee  in  a  HIT  overlapped  with  a  gold  an¬ 
notation.  Then  the  turker  whose  sampled  HIT 
annotations  agreed  with  the  gold  annotation  less 
than  95%  were  disearded.  Initially,  136  turk- 
ers  submitted  the  results,  out  of  whieh  85  turk- 
ers  seored  above  the  set  95%  threshold.  The  HITs 
that  were  rejeeted  were  resubmitted  to  AMT  for 
re-annotation.  With  resubmission  results,  8  more 
turkers  were  added  as  they  seored  above  the  95% 
threshold  bringing  the  total  number  of  turkers  to 
93.  Aeeordingly,  the  overall  data  was  annotated 
by  93  turkers,  10%  of  the  overall  10500  sentenees 
is  three  way  annotated  with  gold  annotation  and 
by  two  turkers. 

5  Experiment  Results  and  Statistics 

In  this  seetion  we  present  detailed  results  on  the 
eolleeted  annotations.  We  ealeulate  inter- turker 
agreements  based  on  how  many  times  a  turker 
agreed  on  a  eategory  within  the  same  HIT  with 
the  other  turkers  who  eo-annotated  the  same  HIT. 
The  results  for  turkers  were  then  aggregated  to 
find  the  total  number  of  agreements  for  eaeh 
eategory.  The  resulting  eonfusion  matrix  is  shown 
in  Table  1.  The  legend  for  the  table  is  as  follows: 

h-  Hindi 
e-  English 

f-  Eoreign  Proper  Name 
i-  Indian  Proper  Name 
u-  Unknown 


Eaeh  eell  of  the  eonfusion  matrix  eorresponds  to 
agreement  eounts  for  any  turker  aggregately  with 
respeetive  eo-turkers. 

The  following  detailed  statisties  show  the  per- 
eentage  elassifieation  agreement  among  the  eo- 
turkers  in  different  eategories  for  the  majority  an¬ 
notated  elass  on  the  word  level.  As  mentioned 
above,  eaeh  HIT  was  annotated  by  two  turkers. 
In  our  detailed  statisties,  we  observe  the  num¬ 
ber  of  times  two  turkers  agreed  on  a  eategory  la¬ 
bel  per  word  in  the  same  HIT.  We  report  below 
the  pereentage  of  aggregate  pairwise  agreements 


h 

e 

f 

i 

u 

h 

167195 

1875 

370 

535 

426 

e 

2546 

11800 

229 

47 

215 

f 

578 

253 

3996 

143 

45 

i 

546 

45 

120 

1467 

29 

u 

442 

212 

40 

32 

99 

Table  1:  Confusion  Matrix  of  the  aggregate  turk¬ 
ers’  annotations  for  the  different  eategories 

among  the  turkers  for  those  eategories.  We  re¬ 
port  the  results  of  the  analysis  by  the  majority 
elass.  Henee  for  those  instanees  that  are  eonsid- 
ered  Hindi  aeross  the  HITs,  98.1%  of  the  times, 
some  two  turkers  agreed  on  a  Hindi  label. 

All  in  all,  the  data  had  193285  word  instanees, 
eorresponding  to  14658  word  types,  88.16%  word 
instances  were  considered  Hindi  by  the  majority 
of  turkers,  7.67%  instances  were  considered 
English  by  the  majority  of  turkers,  2.59%  words 
were  considered  Eoreign  Proper  names,  and 
1.14%  were  considered  Indian  Proper  names, 
hnally  0.42%  were  considered  Unknowns.  The 
following  statistics  reflect  the  confusion  on  the 
majority  label  by  aggregate  pairs  of  turkers. 

Eor  majority  class  Hindi  word  instances  (88.16% 
of  the  word  instances): 

Hindi-  98.1% 

English-  1.1% 

Eoreign  Proper  Name-  0.22% 

Indian  Proper  Name-  0.31% 

Unknown-  0.25% 

Hence,  turkers  agreed  98. 1%  of  the  time  that  the 
label  for  these  88.16%  of  the  word  instances  are 
Hindi,  however,  some  set  of  the  turker  pairs  con¬ 
fused  1.1%  of  this  Hindi  data  set  as  English,  while 
0.25%  of  the  time  pairs  of  turkers  considered  these 
Hindi  words  as  Unknown. 

Eor  majority  class  English  word  instances 
(7.67%  of  the  word  instances): 

Hindi-  17.16% 

English-  79.53% 

Eoreign  Proper  Name-  1.54% 

Indian  Proper  Name-  .32% 

Unknown-  1.45% 

The  turkers  agreed  79.53%  of  the  time  that 
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the  label  for  these  7.67%  of  the  word  instanees 
are  English,  however,  some  set  of  the  turker 
pairs  eonfused  17.16%  of  this  English  data  set 
as  Hindi,  1.54%  as  Eoreign  Proper  Name,  0.32% 
as  Indian  Proper  Name  and  1.45%  of  the  time 
pairs  of  turkers  considered  these  English  words  as 
Unknown 

Eor  majority  class  Eoreign  Proper  Name  word 
instances  (2.59%  of  the  word  instances): 

Hindi-  11.52% 

English-  5.04% 

Eoreign  Proper  Name-  79.68% 

Indian  Proper  Name-  2.85% 

Unknown-  .9% 

The  turkers  agreed  79.68%  of  the  time  that 
the  label  for  these  2.59%  of  the  word  instances 
are  Eoreign  Proper  Name,  however,  some  set  of 
the  turker  pairs  confused  11.52%  of  this  Eoreign 
Proper  Name  data  set  as  Hindi,  5.04%  as  English, 
2.85%  as  Indian  Proper  Name  and  0.9%  of  the 
time  pairs  of  turkers  considered  these  Eoreign 
Proper  Names  as  Unknown 

Eor  majority  class  Indian  Proper  Name  word 
instances  (1.14%  of  the  word  instances): 

Hindi-  24.74% 

English-  2.04% 

Eoreign  Proper  Name-  5.44% 

Indian  Proper  Name-  66.47% 

Unknown-  1.31% 

Eor  majority  class  Unknown  word  instances 
(0.42%  of  the  word  instances): 

Hindi-  53.58% 

English-  25.7% 

Eoreign  Proper  Name-  4.85% 

Indian  Proper  Name-  3.88% 

Unknown-  12% 


The  above  statistics  are  the  aggregated  results, 
we  note  that  the  results  for  each  of  the  93  turk¬ 
ers  taken  individually,  as  compared  to  their  respec¬ 
tive  co-turkers  follow  the  same  trend  as  the  aggre¬ 
gated  results.  Eor  example,  if  we  compare  an  in¬ 
dividual  turker  with  co-turkers,  majority  of  agree¬ 


ments  are  Hindi-Hindi,  English-English  and  so  on. 
Similarly,  disagreements  are  also  proportionate  to 
above  statistics. 

A  detailed  token  level  analysis  also  showed  sim¬ 
ilar  trends.  We  analyzed  a  sample  of  1304  to¬ 
kens  of  which  1005  have  a  Hindi  root  and  245 
are  of  English  etymology.  27  tokens  were  Eor¬ 
eign  Proper  Names  and  26  were  Indian  Proper 
Names.  The  turkers  agreed  98.45%  times  that  the 
tokens  are  Hindi  over  the  total  occurrences  of  sam¬ 
ple  Hindi  root  tokens.  They  agreed  79.41%  times 
that  the  token  is  English  over  tokens  with  English 
root,  75.74%  times  agreed  that  the  token  is  Eor¬ 
eign  Proper  Name  for  Eoreign  Proper  Name  to¬ 
kens.  The  turkers  agreed  74.58%  times  that  the 
token  is  an  Indian  Proper  Name  for  Indian  Proper 
Name  tokens.  The  turkers  were  observed  to  con¬ 
fuse  Hindi  tokens  and  Indian  Proper  Name  tokens 
as  22.63%  times,  i.e.  they  mutually  agreed  that  the 
token  is  Hindi  when  it  was  in  fact  an  Indian  Proper 
Name. 

We  further  analyze  the  agreement  on  a  complete 
sentence  level,  where  turkers  agreed  on  the  anno¬ 
tation  for  every  token  in  the  sentence,  we  found 
only  57  such  sentence  annotations. 

6  Analysis  of  Results 

As  depicted  by  the  above  statistics,  the  largest  per¬ 
centage  of  agreement  was  for  the  words  marked  in 
the  Hindi  category.  There  was  about  98%  agree¬ 
ment  over  such  words  which  can  be  attributed  to 
two  reasons.  Eirstly,  Hindi  being  the  matrix  lan¬ 
guage,  a  dominant  part  of  words  in  the  sentences 
were  Hindi.  Secondly,  although  there  were  very 
few  instances  where  the  turkers  completely  agreed 
on  each  word  of  a  sentence,  they  had  almost  no 
confusion  in  identifying  the  Hindi  words  in  a  sen¬ 
tence. 

Eor  a  word  classified  as  English  by  a  turker, 
the  co-turkers  agreed  80%  times.  However,  about 
17%  co-turkers  confused  such  words  as  Hindi.  An 
obvious  reason  for  such  observation  is  the  fact  that 
some  of  the  English  words  have  blended  so  well 
with  Hindi  that  even  the  native  Hindi  speakers 
are  not  able  to  recognize  them  as  English  words. 
Eor  example,  English  words  such  as  cycle,  car, 
train,  plate,  bread  have  become  part  of  Hindi  lex¬ 
icons  and  the  native  speakers  unintelligibly  con¬ 
sider  these  words  as  Hindi  itself  in  their  conver¬ 
sations.  This  shows  the  seamless  mingling  of  En¬ 
glish  words  in  Hindi  to  such  an  extent  that  they  are 
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indistinguishable  as  English  words. 

There  was  agreement  for  majority  of  Foreign 
and  Indian  Proper  Names  (79.68%  and  66.47%  re- 
speetively).  The  highest  pereentage  of  disagree¬ 
ments  were  observed  when  the  eo-turkers  marked 
proper  names  as  Hindi  words.  This  may  be  at¬ 
tributed  to  the  faet  that  eapitalization  does  not  ex¬ 
ist  in  the  Hindi  seript  for  proper  names,  and  they 
might  be  miseonstrued  as  other  parts  of  speeeh. 
For  example,  Pawan  eould  be  a  name  and  eould 
be  used  as  a  noun  meaning  air  as  well.  Similarly, 
Anant  eould  be  used  as  a  name  or  an  adjeetive 
meaning  with  out  an  end. 

The  Unknown  eategory  showed  some  interest¬ 
ing  results.  The  agreement  over  Unknown  eate¬ 
gory  was  mueh  less  among  turkers.  Instead,  ma¬ 
jority  of  eo-turkers  marked  sueh  words  as  Hindi 
words  (53.58%  times).  After  analysis,  it  was 
found  that  a  major  reason  for  this  observation  was 
because  turkers  were  confused  on  morphologi¬ 
cally  mixed  words.  For  example,  plural  of  com¬ 
pany  after  morphological  adjustment  becomes 
companiyon  in  Hinglish.  A  turker  was  not  able  to 
classify  such  words  distinctly  since  it  is  half  Hindi 
and  half  English  from  his  point  of  view.  More¬ 
over,  we  believe  that  the  fact  that  we  have  a  default 
Hindi  tag,  could  have  contributed  to  the  confusion. 
In  our  next  iteration  of  annotation  experiments,  we 
will  make  sure  to  avoid  a  default  tag. 

If  we  consider  the  overall  results,  more  than 
90%  of  agreements  were  for  Hindi  words  followed 
by  English,  Foreign  Proper  Name,  Indian  Proper 
Name  and  Unknown  categories,  in  that  order.  This 
along  with  results  for  each  of  the  individual  cate¬ 
gories  shows  that  the  turkers  have  high  confidence 
while  marking  the  Hindi  words.  English  words, 
foreign  proper  names  and  Indian  proper  names 
also  show  good  confidence  wifh  agreemenf  over 
majorify  of  fhem.  Majorify  of  disagreemenfs  in 
differenf  cafegories  were  classified  as  Hindi  which 
shows  fhe  fendency  of  fhe  furkers  fo  mark  fhe 
word,  abouf  which  fhey  are  confused,  as  Hindi  if- 
self,  or  simply  leave  if  as  defaulf.  Based  on  anal¬ 
ysis  of  resulfs  above  in  conjuncfion  wifh  fhe  infer- 
furker  and  gold  agreemenfs,  if  can  be  affirmed  wifh 
high  confidence  fhaf  aparf  from  fhe  Unknown  caf- 
egory  words,  fhe  furkers  converge  on  fhe  correcf 
cafegory  significanfly  above  chance  indicafing  fhe 
feasibilify  of  fhe  approach. 


7  Conclusion  and  Future  Directions 

In  fhis  paper  we  presenfed  an  inifial  affempf  af 
building  a  large  scale  repository  of  manually  anno- 
fafed  ECS  dafa  for  Hinglish.  We  believe  we  have 
esfablished  fhaf  crowd  sourcing  is  a  good  mefhod 
for  inducing  such  annofafions.  In  fhe  near  fufure 
we  plan  on  annofafing  more  dafa.  We  plan  on 
adding  a  new  cafegory  label  of  mixed  morphology. 
Finally,  we  infend  fo  perform  fhe  same  annofafion 
fask  for  ofher  language  pairs. 
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